Skip to content

PatchedArray: basics and wiring#7002

Open
a10y wants to merge 27 commits intodevelopfrom
aduffy/patched-array
Open

PatchedArray: basics and wiring#7002
a10y wants to merge 27 commits intodevelopfrom
aduffy/patched-array

Conversation

@a10y
Copy link
Copy Markdown
Contributor

@a10y a10y commented Mar 17, 2026

Summary

This is the first PR in a series addressing the PatchedArray RFC: vortex-data/rfcs#27

This PR adds a new PatchedArray array variant, which is slated to only be used with FastLanes array types. The design is largely documented in the RFC, but briefly

  • Wraps another child array
  • Holds a buffer of offsets and patch values. They are both sorted by chunk/lane so there is efficient random as well as bulk access for CUDA

We are able to pushdown the following at reduce time:

  • Filters pushes through to the child, and we trim the patch indices to the range of chunks covered by the filter
  • Slicing that reduces the chunk range
  • Compare always pushes through to the child
  • Take pushes through to the inner then constructs a hashmap of the patches

There will be follow ups to add the wiring into CUDA and to update how BitPacked and ALP arrays are written.

Testing

There are unit tests for all of the reducers and kernels

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Mar 17, 2026

Merging this PR will degrade performance by 24.45%

❌ 9 regressed benchmarks
✅ 1097 untouched benchmarks
⏩ 1522 skipped benchmarks1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation old_alp_prim_test_between[f32, 32768] 398.1 µs 444.5 µs -10.44%
Simulation new_bp_prim_test_between[i32, 16384] 101.5 µs 116.2 µs -12.67%
Simulation take_10k_contiguous 270.9 µs 329 µs -17.65%
Simulation new_bp_prim_test_between[i32, 32768] 149.1 µs 178.3 µs -16.38%
Simulation new_bp_prim_test_between[i64, 16384] 118.3 µs 147.9 µs -20.03%
Simulation take_10k_random 209.1 µs 267.3 µs -21.78%
Simulation new_bp_prim_test_between[i64, 32768] 182.4 µs 241.4 µs -24.45%
Simulation new_alp_prim_test_between[f64, 16384] 131.2 µs 153.3 µs -14.45%
Simulation old_bp_prim_test_between[i64, 32768] 473.4 µs 535.5 µs -11.6%

Comparing aduffy/patched-array (5412a08) with develop (3c5fbd3)

Open in CodSpeed

Footnotes

  1. 1522 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@a10y a10y added the changelog/feature A new feature label Mar 18, 2026
@a10y a10y marked this pull request as ready for review March 18, 2026 21:21
@a10y a10y requested a review from joseph-isaacs March 18, 2026 21:21
@a10y a10y force-pushed the aduffy/patched-array branch from c3d7434 to 6c1d7aa Compare March 19, 2026 13:24
@a10y a10y force-pushed the aduffy/patched-array branch from 860cc64 to 042394b Compare March 24, 2026 19:17
@a10y a10y force-pushed the aduffy/patched-array branch 4 times, most recently from df8dbf5 to 862d814 Compare March 26, 2026 19:54
@a10y
Copy link
Copy Markdown
Contributor Author

a10y commented Mar 27, 2026

@claude review please

@claude

This comment was marked as resolved.

@a10y
Copy link
Copy Markdown
Contributor Author

a10y commented Mar 27, 2026

@claude i have addressed your main comments. please give another review.

@claude

This comment was marked as resolved.

@a10y a10y force-pushed the aduffy/patched-array branch from daf258f to 578d83d Compare March 27, 2026 14:35
@a10y
Copy link
Copy Markdown
Contributor Author

a10y commented Mar 27, 2026

@claude please review

@claude

This comment was marked as resolved.

@a10y
Copy link
Copy Markdown
Contributor Author

a10y commented Mar 27, 2026

@claude PR is ready for another review

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 27, 2026

Claude finished @a10y's task in 8m 28s —— View job


PR Review: PatchedArray basics and wiring

Tasks:

  • Read PR context and understand the changes
  • Read core array implementation files
  • Read compute kernel implementations
  • Read vtable implementation
  • Read test/compat fixtures
  • Provide comprehensive review feedback

Overall this is solid work — the design is clean, the test coverage is good, and the key operations (slice, filter, take, compare, scalar_at) are all well-structured. I found three correctness bugs of varying severity. Details below.


🔴 Bug 1: build() computes n_chunks incorrectly for sliced arrays

File: vortex-array/src/arrays/patched/vtable/mod.rs, line 246

let n_chunks = len.div_ceil(1024);  // WRONG for offset > 0

When a PatchedArray is sliced at a non-chunk boundary, array.offset > 0. For example, a slice from element 5 to element 1025 produces len = 1020, offset = 5, but the array spans 2 chunks. The lane_offsets buffer has n_chunks_actual * n_lanes + 1 = 2 * n_lanes + 1 entries, but build() reconstructs with n_chunks = 1020.div_ceil(1024) = 1, causing all downstream chunk/lane indexing to be wrong after a round-trip through serialization.

Fix: let n_chunks = (len + metadata.offset as usize).div_ceil(1024);

This matches how slice_chunks counts chunks and how SliceReduce selects chunk boundaries. All the invariants hold because n_chunks is always derivable from (len + offset).

Fix this →


🔴 Bug 2: Missing upper-bounds check in compare.rs's ApplyPatches::apply

File: vortex-array/src/arrays/patched/compute/compare.rs, lines 139–149

let bit_index = chunk * 1024 + patch_index as usize;
if bit_index < self.offset {
    continue;  // lower-bound guard ✓
}
let bit_index = bit_index - self.offset;
// No upper-bound check — can write past end of bits buffer!
if cmp(patch_value, self.constant) {
    self.bits.set(bit_index)
} else {
    self.bits.unset(bit_index)
}

The bits buffer has lhs.len bits. A patch in the final chunk at a position beyond offset + len (which can easily happen when len is not a multiple of 1024 — e.g., 2048-element array with a patch at index 2047 sliced to [0, 2040)) causes bit_index = 2047 - 0 = 2047 >= 2040, an out-of-bounds write.

Fix: Add if bit_index >= self.bits.len() { continue; } (or use self.offset + lhs_len as an explicit upper bound) after computing bit_index - self.offset. The equivalent guard already exists in apply_patches_primitive (line 369 of vtable/mod.rs).

Fix this →


🔴 Bug 3: Double-offset in slice_chunks when chunks.start == 0 and offset > 0

File: vortex-array/src/arrays/patched/array.rs, lines 138–168

let begin = (chunks.start * 1024).max(self.offset);  // = self.offset when start=0
let inner = self.inner.slice(begin..end)?;            // inner[0] = original[self.offset]
let offset = begin % 1024;                            // = self.offset again

When chunks.start == 0 and self.offset > 0 (i.e. filtering a non-chunk-boundary-sliced array that doesn't prune front chunks), the inner gets sliced by self.offset bytes AND the new offset field is still self.offset. During execute(), buffer.slice_typed(offset..offset+len) then skips another self.offset elements — effectively double-counting.

This path is reachable via filter.rs: when chunk_start == 0 && chunk_stop < n_chunks, slice_chunks(0..chunk_stop) is called. The existing test_filter_sliced test avoids this because it slices at chunk boundary 1024 (offset=0).

Fix: Distinguish the two cases:

  • When chunks.start > 0: begin = chunks.start * 1024, new offset = 0 (chunk-aligned start)
  • When chunks.start == 0: begin = 0, new offset = self.offset, new len = min(chunks.end * 1024, self.offset + self.len) - self.offset

Fix this →


🟡 Memory: slice_chunks retains all patch data

File: array.rs, lines 145–146

let indices = self.indices.clone();
let values = self.values.clone();

After slice_chunks, the indices and values arrays still hold patches for the entire original array, not just the retained chunk range. The lane_offsets correctly limits access, so this is not a correctness issue, but after a selective filter on a large array, all patches are still in memory. Worth a // TODO: trim indices and values to the patch range of lane_offsets[0]..lane_offsets[last] comment at least.


🟡 take_map builds the full patch hashmap even for selective takes

File: compute/take.rs, lines 92–107

for chunk in 0..n_chunks {
    for lane in 0..n_lanes {
        // ... inserts all patches into hashmap
    }
}

This scans all patches regardless of which elements are being taken. For a sparse take on a large array with many patches, it might be worth looping over the take indices first to identify which chunks are relevant, then only loading patches for those chunks. Not a blocker, but worth a // TODO for future optimization.


✅ Things that look good

  • The transpose / scalar_at lane computation is consistent (both use absolute_index % n_lanes, where absolute_index is relative to the pre-offset chunk start, and n_lanes always divides 1024, so chunk boundaries don't break lane alignment).
  • apply_patches_primitive correctly guards against both lower and upper out-of-bounds patches.
  • The SliceReduce implementation is elegant — slicing always to chunk boundaries and tracking the sub-chunk offset is exactly right.
  • The compat fixture is good, covering multi-chunk with patches in different chunks.
  • Test coverage is thorough for the happy paths: rstest parameterization in slice.rs is especially nice.

@a10y

This comment was marked as resolved.

a10y added 5 commits March 27, 2026 15:08
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
a10y added 18 commits March 27, 2026 15:08
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
We call append_to_builder on the inner first, then just do a single pass
and overwrite what it just wrote.

Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y a10y force-pushed the aduffy/patched-array branch from 8648244 to a9ce27f Compare March 27, 2026 19:33
Comment on lines +533 to +552
pub fn reinterpret<V: Sized>(&self) -> &[V] {
assert!(
self.is_aligned(Alignment::of::<V>()),
"ByteBuffer not properly aligned to {}",
type_name::<V>()
);

assert_eq!(
self.length % size_of::<V>(),
0,
"ByteBuffer length not a multiple of the value length"
);

let v_len = self.length / size_of::<V>();
let v_ptr = self.bytes.as_ptr().cast::<V>();

// SAFETY: we checked that alignment and length are suitable to treat this as a &[V].
unsafe { std::slice::from_raw_parts(v_ptr, v_len) }
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we never do this before?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I coudln't find anything no

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we add a few more of these with different types? e.g. string?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Executing anything other than primitive would fail

pub(super) len: usize,

/// lane offsets. The PType of these MUST be u32
pub(super) lane_offsets: BufferHandle,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure we don't want to have the option to compress these?

Comment on lines +35 to +36
/// Number of 1024-element chunks. Pre-computed for convenience.
pub(super) n_chunks: usize,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why store this?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will increase the stack size and maybe register pressure do avoid / 1024?

/// Number of lanes the patch indices and values have been split into. Each of the `n_chunks`
/// of 1024 values is split into `n_lanes` lanes horizontally, each lane having 1024 / n_lanes
/// values that might be patched.
pub(super) n_lanes: usize,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this too

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This I think we need to store incase we change the default ptype -> lane mapping in the future

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this kind of change would require new arrays. The data layout would change

}

#[derive(Clone, prost::Message)]
pub struct PatchedMetadata {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you can pack this into a u128?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, is there a reason to do that over struct? can protobuf even encode u128?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean using less and smaller fields

/// in a filter or slice it may be sliced to the nearest chunk boundary.
#[prost(uint64, tag = "1")]
pub(crate) inner_len: u64,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used only to fetch

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by this

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am asking why we store this, and its likely cause the inner may have a different len to the result patch array?

Comment on lines +306 to +309
array.indices = children.remove(0);
array.values = children.remove(0);

Ok(())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on debug check the dtypes?

a10y added 4 commits March 30, 2026 09:10
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants